JSC 370: Data Science II

Week 3: EDA and Data Visualization

Goals

  • Load in large datasets stored locally and remotely (GitHub), check memory issues
  • Understand what EDA is (and is not)
  • Build a repeatable EDA workflow
  • Make plots that reveal structure, not just decoration
  • Avoid common visualization pitfalls

Data Science Pipeline

  • We will work on the blue words today


What is Exploratory Data Analysis?

  • EDA = iterative exploration and building modeling intuition
  • Focus on data distributions, missingness, relationships, anomalies
  • Create: summary stats, exploratory plots
  • Output: questions, hypotheses, and next steps

Exploratory Data Analysis

  • EDA is the process of summarizing data

  • It should be the first step in your analysis pipeline, and it serves to:

    • ✅ check data (import issues, outliers, missing values, data errors)
    • 🧹 clean data (fix implausible values, remove missing if necessary, rename, etc.)
    • \(\Sigma\) summary statistics of key variables (univariate and bivariate)
    • 📊 basic plots and graphs

EDA Checklist

EDA Checklist:

    1. Formulate a question
    1. Read in the data
    1. Check the dimensions and headers and footers of the data
    1. Check the variable types in the data
    1. Take a closer look at some/all of the variables
    1. Validate with an external source
    1. Conduct some summary statistics to answer the initial question
    1. Make exploratory graphs

Case Study

We are going to use a dataset created from the National Center for Environmental Information NOAA/NCEI. The data are hourly measurements from weather stations across the continental U.S.


Formulate a Question

It is a good idea to first have a question such as:

  • what weather stations reported the hottest and coldest daily temperatures?
  • what day of the month was on average the hottest?
  • is there covariation between temperature and humidity in my dataset?

Read in the data

There are several ways to read in data (some depend on the type of data you have):

  • pandas.read_csv() for delimited files
  • pandas.read_parquet() for parquet
  • pandas.read_feather() for feather
  • pandas.read_excel() for .xls and .xlsx
  • pyarrow.dataset / dask for larger-than-memory workflows

Read in the data

We’ll use Python packages pandas, numpy, matplotlib, seaborn, plotnine

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from plotnine import ggplot, aes, geom_point, labs

Example: read in local file

If your data are a gzipped delimited file (e.g., met_all.gz), pandas can read it directly.

met = pd.read_csv("met_all.gz", compression="gzip")
met.head(4)
USAFID WBAN year month day hour min lat lon elev ... vis.dist.qc vis.var vis.var.qc temp temp.qc dew.point dew.point.qc atm.press atm.press.qc rh
0 690150 93121 2019 8 1 0 56 34.3 -116.166 696 ... 5 N 5 37.2 5 10.6 5 1009.9 5 19.881266
1 690150 93121 2019 8 1 1 56 34.3 -116.166 696 ... 5 N 5 35.6 5 10.6 5 1010.3 5 21.760980
2 690150 93121 2019 8 1 2 56 34.3 -116.166 696 ... 5 N 5 34.4 5 7.2 5 1010.6 5 18.482122
3 690150 93121 2019 8 1 3 56 34.3 -116.166 696 ... 5 N 5 33.3 5 5.0 5 1011.6 5 16.888622

4 rows × 30 columns


Example: read in github file

If your data are a gzipped delimited file but stored on github, use

url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
met = pd.read_csv(url, compression="gzip")    
met.head(4)
USAFID WBAN year month day hour min lat lon elev ... vis.dist.qc vis.var vis.var.qc temp temp.qc dew.point dew.point.qc atm.press atm.press.qc rh
0 690150 93121 2019 8 1 0 56 34.3 -116.166 696 ... 5 N 5 37.2 5 10.6 5 1009.9 5 19.881266
1 690150 93121 2019 8 1 1 56 34.3 -116.166 696 ... 5 N 5 35.6 5 10.6 5 1010.3 5 21.760980
2 690150 93121 2019 8 1 2 56 34.3 -116.166 696 ... 5 N 5 34.4 5 7.2 5 1010.6 5 18.482122
3 690150 93121 2019 8 1 3 56 34.3 -116.166 696 ... 5 N 5 33.3 5 5.0 5 1011.6 5 16.888622

4 rows × 30 columns


Check the types of variables in our dataset

We can get the types of data (integer, float, character) from met.info. This also tells us the memory usage.

met.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2377343 entries, 0 to 2377342
Data columns (total 30 columns):
 #   Column             Dtype  
---  ------             -----  
 0   USAFID             int64  
 1   WBAN               int64  
 2   year               int64  
 3   month              int64  
 4   day                int64  
 5   hour               int64  
 6   min                int64  
 7   lat                float64
 8   lon                float64
 9   elev               int64  
 10  wind.dir           float64
 11  wind.dir.qc        object 
 12  wind.type.code     object 
 13  wind.sp            float64
 14  wind.sp.qc         object 
 15  ceiling.ht         float64
 16  ceiling.ht.qc      int64  
 17  ceiling.ht.method  object 
 18  sky.cond           object 
 19  vis.dist           float64
 20  vis.dist.qc        object 
 21  vis.var            object 
 22  vis.var.qc         object 
 23  temp               float64
 24  temp.qc            object 
 25  dew.point          float64
 26  dew.point.qc       object 
 27  atm.press          float64
 28  atm.press.qc       int64  
 29  rh                 float64
dtypes: float64(10), int64(10), object(10)
memory usage: 544.1+ MB

What counts as “very large” data on a laptop?

On laptops, “very large” usually means it stresses RAM and slows basic operations.

A dataset can be “very large” even with only a few million rows if it has:

  • many columns
  • lots of strings (object dtype)
  • expensive operations (joins, groupbys, sorting)

Practical rule-of-thumb

Think in terms of memory, not just rows × columns.

  • Comfortable: under ~1 GB in RAM
  • Large: ~1–5 GB (needs care: dtypes, filtering early)
  • Very large: > ~5 GB or doesn’t fit in RAM → you need a different approach/tool

Measure size in pandas

  • Calculate the total memory used by the DataFrame (convert bytes -> GB)
met.memory_usage(deep=True).sum() / 1024**3
np.float64(1.401238577440381)

Measure size in pandas

  • Find the biggest memory columns (this time in MB)
(met.memory_usage(deep=True).sort_values(ascending=False) / 1024**2).head(10)
dew.point.qc         113.360548
temp.qc              113.360548
vis.var              113.360548
wind.type.code       113.360548
sky.cond             113.360548
ceiling.ht.method    113.360548
vis.dist.qc          103.298048
vis.var.qc           102.860548
wind.sp.qc            99.797945
wind.dir.qc           85.994595
dtype: float64

Why strings make data “big”

  • In pandas, object columns (strings) are often the memory hog.
  • dtypes tells is what kind of variables we have in a dataset.
met.dtypes
USAFID                 int64
WBAN                   int64
year                   int64
month                  int64
day                    int64
hour                   int64
min                    int64
lat                  float64
lon                  float64
elev                   int64
wind.dir             float64
wind.dir.qc           object
wind.type.code        object
wind.sp              float64
wind.sp.qc            object
ceiling.ht           float64
ceiling.ht.qc          int64
ceiling.ht.method     object
sky.cond              object
vis.dist             float64
vis.dist.qc           object
vis.var               object
vis.var.qc            object
temp                 float64
temp.qc               object
dew.point            float64
dew.point.qc          object
atm.press            float64
atm.press.qc           int64
rh                   float64
dtype: object

Big data can be read in more cautiously

If the file is really big (ours is borderline), we could read only the columns we need. This is where having a question in mind is useful. For ours, we probably just want station location (lat, lon), year, month, day, hour, temperature and humidity.

url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
usecols = ["USAFID","lat","lon", "year", "month","day","hour","temp", "rh"]
met_small = pd.read_csv(
    url,
    compression="gzip",
    usecols=usecols
)
met_small.head(4)
USAFID year month day hour lat lon temp rh
0 690150 2019 8 1 0 34.3 -116.166 37.2 19.881266
1 690150 2019 8 1 1 34.3 -116.166 35.6 21.760980
2 690150 2019 8 1 2 34.3 -116.166 34.4 18.482122
3 690150 2019 8 1 3 34.3 -116.166 33.3 16.888622

Big data can be read in more cautiously

Chunking in pandas means reading in a file in pieces so you don’t load the whole dataset into RAM at once.

url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all.gz"
chunks = pd.read_csv(url, compression="gzip", chunksize=200_000)
n = 0
temp_sum = 0.0
for ch in chunks:
    temp_sum += ch["temp"].sum(skipna=True)
    n += ch["temp"].notna().sum()
temp_sum / n
np.float64(23.590542469664527)

This computes the overall mean temperature without ever storing the full dataset:

  • temp_sum accumulates the sum of temps across chunks.
  • n accumulates how many non-missing temps you saw.
  • temp_sum / n is the mean.

We’re basically doing: \(\bar{x} = \frac{\sum x_i}{\#\{x_i \text{ not missing}\}}\)


Check the data

Once we have loaded the data we should check the dimensions of the dataset. In pandas, the we use met.shape.

print("shape:", met.shape)
print("n_rows:", met.shape[0])
print("n_cols:", met.shape[1])
shape: (2377343, 30)
n_rows: 2377343
n_cols: 30

Check the data: Summary

  • We see that there are 2,377,343 records of hourly temperature in August 2019 from all of the weather stations in the US. The data set has 30 variables.

  • We know that we are supposed to have hourly measurements of weather data for the month of August 2019 for the entire US. We should check that we have all of these components. Let us check:

    • the year, month, hours
    • the range of locations (latitude and longitude)
    • temperature
    • relative humidity

Check the data more closely

We can get a quick summary from met.describe.

met.describe()
USAFID WBAN year month day hour min lat lon elev wind.dir wind.sp ceiling.ht ceiling.ht.qc vis.dist temp dew.point atm.press atm.press.qc rh
count 2.377343e+06 2.377343e+06 2377343.0 2377343.0 2.377343e+06 2.377343e+06 2.377343e+06 2.377343e+06 2.377343e+06 2.377343e+06 1.592053e+06 2.297650e+06 2.256068e+06 2.377343e+06 2.296387e+06 2.317254e+06 2.311055e+06 711069.000000 2.377343e+06 2.310917e+06
mean 7.230995e+05 2.953885e+04 2019.0 8.0 1.599589e+01 1.134106e+01 3.956097e+01 3.794132e+01 -9.214509e+01 4.158152e+02 1.850082e+02 2.459166e+00 1.616648e+04 5.027185e+00 1.492117e+04 2.359054e+01 1.702084e+01 1014.164390 7.728208e+00 7.164138e+01
std 2.156678e+03 3.336921e+04 0.0 0.0 8.917096e+00 6.860069e+00 1.724761e+01 5.171552e+00 1.195324e+01 5.737842e+02 9.230697e+01 2.147395e+00 9.282465e+03 1.249445e+00 3.879577e+03 6.059275e+00 6.206226e+00 4.059265 2.018170e+00 2.275361e+01
min 6.901500e+05 1.160000e+02 2019.0 8.0 1.000000e+00 0.000000e+00 0.000000e+00 2.455000e+01 -1.242900e+02 -1.300000e+01 3.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 -4.000000e+01 -3.720000e+01 960.500000 1.000000e+00 8.334298e-01
25% 7.209280e+05 3.706000e+03 2019.0 8.0 8.000000e+00 5.000000e+00 2.300000e+01 3.396700e+01 -9.801800e+01 1.010000e+02 1.200000e+02 0.000000e+00 3.048000e+03 5.000000e+00 1.609300e+04 1.960000e+01 1.380000e+01 1011.800000 5.000000e+00 5.579015e+01
50% 7.227280e+05 1.386000e+04 2019.0 8.0 1.600000e+01 1.100000e+01 5.000000e+01 3.835000e+01 -9.170800e+01 2.520000e+02 1.800000e+02 2.100000e+00 2.200000e+04 5.000000e+00 1.609300e+04 2.350000e+01 1.810000e+01 1014.100000 9.000000e+00 7.655374e+01
75% 7.250900e+05 5.476700e+04 2019.0 8.0 2.400000e+01 1.700000e+01 5.500000e+01 4.194000e+01 -8.298600e+01 4.000000e+02 2.600000e+02 3.600000e+00 2.200000e+04 5.000000e+00 1.609300e+04 2.780000e+01 2.170000e+01 1016.400000 9.000000e+00 9.062916e+01
max 7.268130e+05 9.499800e+04 2019.0 8.0 3.100000e+01 2.300000e+01 5.900000e+01 4.894100e+01 -6.831300e+01 9.999000e+03 3.600000e+02 3.600000e+01 2.200000e+04 9.000000e+00 1.600000e+05 5.600000e+01 3.600000e+01 1059.900000 9.000000e+00 1.000000e+02
  • Note that this doesn’t tell us if there are any missing data!

Check the data more closely

It is a little harder to quickly summarize missing data in pandas, but here’s a way to count missing observations and calculate the percent missing.

missing = met.isna().sum().to_frame("n_missing")
missing["pct_missing"] = (missing["n_missing"] / len(met)).round(4)
missing.sort_values("pct_missing", ascending=False).head(20)
n_missing pct_missing
atm.press 1666274 0.7009
wind.dir 785290 0.3303
ceiling.ht 121275 0.0510
vis.dist 80956 0.0341
wind.sp 79693 0.0335
dew.point 66288 0.0279
rh 66426 0.0279
temp 60089 0.0253
lat 0 0.0000
day 0 0.0000
atm.press.qc 0 0.0000
year 0 0.0000
dew.point.qc 0 0.0000
month 0 0.0000
temp.qc 0 0.0000
vis.var.qc 0 0.0000
vis.var 0 0.0000
vis.dist.qc 0 0.0000
sky.cond 0 0.0000
lon 0 0.0000

Check variables more closely

If we return to our initial question—what weather stations reported the hottest and coldest temperatures, we should take a closer look at our key variable, temperature (temp).

met["temp"].describe()
count    2.317254e+06
mean     2.359054e+01
std      6.059275e+00
min     -4.000000e+01
25%      1.960000e+01
50%      2.350000e+01
75%      2.780000e+01
max      5.600000e+01
Name: temp, dtype: float64
  • It looks like the data are in Celsius.

Check variables more closely

  • It’s odd to have -40C in August.
  • Subset rows where temp == -40.0 and select a few columns with loc, which is an indexer for selecting rows and columns.
met_ss = met.loc[met["temp"] == -40.0, ["USAFID","hour", "lat", "lon",  "elev", "temp", "wind.sp"]]
met_ss.shape
met_ss.head()
USAFID hour lat lon elev temp wind.sp
496047 720717 0 29.117 -89.55 36 -40.0 NaN
496048 720717 0 29.117 -89.55 36 -40.0 NaN
496049 720717 0 29.117 -89.55 36 -40.0 NaN
496050 720717 1 29.117 -89.55 36 -40.0 NaN
496051 720717 1 29.117 -89.55 36 -40.0 NaN
  • It appears that all of the data with -40C are from the same site and have missing. Points to a data issue.

Validate against an external source

We should check outside sources to make sure that our data make sense. The observation with -40°C is suspicious, so we should look up the location of the weather station.

Go to Google maps and enter the coordinates for the site with -40C (29.12, -89.55)

  • It doesn’t make much sense to have a -40C reading in the Gulf of Mexico off the coast of Louisiana.

Filter implausible values

Filter with loc and sort the data with sort_values in one line:

met = met.loc[met["temp"] > -40].sort_values("temp")
met.head(20)
USAFID WBAN year month day hour min lat lon elev ... vis.dist.qc vis.var vis.var.qc temp temp.qc dew.point dew.point.qc atm.press atm.press.qc rh
1203054 722817 3068 2019 8 1 1 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203127 722817 3068 2019 8 3 11 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203128 722817 3068 2019 8 3 12 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203221 722817 3068 2019 8 6 21 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203224 722817 3068 2019 8 6 22 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203052 722817 3068 2019 8 1 0 56 38.767 -104.300 1838 ... 9 N 5 -17.2 5 NaN 9 NaN 9 NaN
1203225 722817 3068 2019 8 6 23 3 38.767 -104.300 1838 ... 9 N 5 -17.0 6 NaN 9 NaN 9 NaN
1203220 722817 3068 2019 8 6 21 39 38.767 -104.300 1838 ... 9 N 5 -17.0 6 NaN 9 NaN 9 NaN
1203222 722817 3068 2019 8 6 22 32 38.767 -104.300 1838 ... 9 N 5 -17.0 6 NaN 9 NaN 9 NaN
1203223 722817 3068 2019 8 6 22 41 38.767 -104.300 1838 ... 9 N 5 -17.0 6 NaN 9 NaN 9 NaN
1203053 722817 3068 2019 8 1 1 6 38.767 -104.300 1838 ... 9 N 5 -17.0 5 NaN 9 NaN 9 NaN
1105458 722518 12974 2019 8 12 3 56 27.901 -98.052 78 ... 9 9 9 -17.0 1 NaN 9 1011.6 1 NaN
1105456 722518 12974 2019 8 12 1 56 27.901 -98.052 78 ... 9 9 9 -17.0 1 NaN 9 1010.3 1 NaN
1105454 722518 12974 2019 8 11 23 56 27.901 -98.052 78 ... 9 9 9 -17.0 1 NaN 9 1010.3 1 NaN
2370757 726764 94163 2019 8 27 11 50 44.683 -111.116 2025 ... 5 N 5 -3.0 C -5.0 C NaN 9 86.265368
2370758 726764 94163 2019 8 27 12 10 44.683 -111.116 2025 ... 5 N 5 -3.0 5 -4.0 5 NaN 9 92.910834
2370759 726764 94163 2019 8 27 12 30 44.683 -111.116 2025 ... 5 N 5 -3.0 5 -4.0 5 NaN 9 92.910834
2370760 726764 94163 2019 8 27 12 50 44.683 -111.116 2025 ... 5 N 5 -3.0 C -4.0 C NaN 9 92.910834
252488 720411 137 2019 8 18 12 35 36.422 -105.290 2554 ... 5 N 5 -2.4 5 -3.7 5 NaN 9 90.914749
2370687 726764 94163 2019 8 26 12 30 44.683 -111.116 2025 ... 5 N 5 -2.0 5 -3.0 5 NaN 9 92.966900

20 rows × 30 columns

  • We should do a quick check of (38.767, -104.300) and (27.901, -98.052) to see if the -17C rows are also implausible and should be removed.

Filter and select

Back to question: what weather stations reported the hottest and coldest daily temperatures?

Compute the min and max hourly temperatures and the associated stations. Filter with loc and find min and max (it ignores missing) with idxmin() and inxmax()

min_row = met.loc[met["temp"].idxmin(), ["temp", "USAFID", "day"]]
max_row = met.loc[met["temp"].idxmax(), ["temp", "USAFID", "day"]]
min_row, max_row
(temp       -17.2
 USAFID    722817
 day            1
 Name: 1203054, dtype: object,
 temp        56.0
 USAFID    720267
 day           26
 Name: 42402, dtype: object)
  • Ok but we need daily temperatures to answer our questions!

Grouping and summarizing data

We need to transform our data to answer our initial question at the daily scale. Let’s find daily mean temperature by station and day. Use groupby and agg to calculate the mean and create a new dataset met_daily.

met_daily = (
    met.groupby(["USAFID", "day"], as_index=False)
       .agg(
           temp=("temp", "mean"),
           lat=("lat", "mean"),
           lon=("lon", "mean"),
           rh=("rh", "mean"),
       )
       .sort_values("temp")
)

met_daily.head(), met_daily.tail() 
(       USAFID  day       temp     lat    lon         rh
 20774  722817    3 -17.200000  38.767 -104.3        NaN
 20773  722817    1 -17.133333  38.767 -104.3        NaN
 20775  722817    6 -17.066667  38.767 -104.3        NaN
 43605  726130   11   4.278261  44.270  -71.3  99.540440
 43625  726130   31   4.304348  44.270  -71.3  99.710254,
        USAFID  day       temp        lat         lon         rh
 27043  723805    5  40.975000  34.768000 -114.618000  11.369535
 2345   720339   14  41.000000  32.146000 -111.171000   9.137857
 27042  723805    4  41.183333  34.768000 -114.618000  14.211822
 20623  722787    5  41.357143  33.527000 -112.295000  19.829822
 30     690150   31  41.716667  34.299667 -116.165667   8.676717)

Grouping and summarizing data

What day of the month was the hottest?

min_daily = met_daily.loc[met_daily["temp"].idxmin(), ["temp", "USAFID", "day"]]
max_daily = met_daily.loc[met_daily["temp"].idxmax(), ["temp", "USAFID", "day"]]
min_daily, max_daily
(temp         -17.2
 USAFID    722817.0
 day            3.0
 Name: 20774, dtype: float64,
 temp          41.716667
 USAFID    690150.000000
 day           31.000000
 Name: 30, dtype: float64)
  • The hottest day was August 31st, and the average daily temperature was 41.7C

Looking at two variables at a time

To answer the last question about covariation between temperature and humidity, we can first calculate correlation

pearson = met["temp"].corr(met["rh"], method="pearson")
spearman = met["temp"].corr(met["rh"], method="spearman")

pearson, spearman, spearman - pearson
(np.float64(-0.5464705142984516),
 np.float64(-0.5461260667477443),
 np.float64(0.00034444755070728306))

Is it the same for the daily data?

pearson = met_daily["temp"].corr(met_daily["rh"], method="pearson")
spearman = met_daily["temp"].corr(met_daily["rh"], method="spearman")

pearson, spearman, spearman - pearson
(np.float64(-0.25729536769933825),
 np.float64(-0.20801597675598085),
 np.float64(0.0492793909433574))

The correlation is quite different between hourly and daily data. Why?


Exploratory graphs

With exploratory graphs we aim to:

•   debug any issues remaining in the data
•   understand properties of the data
•   look for patterns in the data
•   inform modeling strategies

Exploratory graphs do not need to be perfect, we’ll look at presentation-ready plots with python’s ggplot equivalent.


Exploratory graphs

What type of data are these types of exploratory graphs good for?

•   histograms 
•   boxplots
•   scatterplots 
•   barplots 
•   lineplots
•   violin plots 
•   maps

Exploratory histogram

Focusing on temperature, let’s look at the distribution of hourly temperature (after removing -40°C) using a histogram.

plt.hist(met["temp"].dropna(), bins=50)
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.title("Hourly Temperature")
plt.show()

Exploratory histogram

Let’s look at the daily data

plt.hist(met_daily["temp"].dropna(), bins=50)
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.title("Daily Mean Temperature")
plt.show()


Exploratory boxplot

A boxplot gives us an idea of the quantiles of the distribution and any outliers.

plt.boxplot(met["temp"].dropna(), vert=True)
plt.ylabel("Temperature (°C)")
plt.title("Hourly temperature")
plt.show()


Exploratory boxplot

Often boxplots are better for grouped continuous data, such as temperature by hour (over all days in the dataset).

hours = range(24)
data_by_hour = [met.loc[met["hour"] == h, "temp"].to_numpy() for h in hours]

plt.boxplot(data_by_hour, labels=list(hours), showfliers=False)
plt.xlabel("Hour")
plt.ylabel("Temperature (°C)")
plt.title("Temperature by Hour")
plt.show()


Exploratory violin plot

Often violin plots are helpful for grouped continuous data (they show the shape of the distribution), such as temperature by hour (over all days in the dataset).

plt.violinplot(data_by_hour, showmeans=False, showmedians=True, showextrema=False)
plt.xticks(ticks=np.arange(1, 25), labels=list(hours))
plt.xlabel("Hour")
plt.ylabel("Temperature (°C)")
plt.title("Temperature by Hour")
plt.show()


Exploratory scatterplot

Let’s check covariation with a scatterplot of humidity vs temperature.

plt.scatter(met["temp"], met["rh"])
plt.xlabel("Temperature (°C)")
plt.ylabel("Relative humidity (%)")
plt.title("Temperature vs humidity (sample)")
plt.show()


Exploratory scatterplot

Since there are so many points let’s take a random sample of 10,000 rows using sample and add alpha transparency to see the scatter better. random_state is like setting a seed for reproducibility.

met_s = met.sample(min(10_000, len(met)), random_state=1)

plt.scatter(met_s["temp"], met_s["rh"], alpha=0.2)
plt.xlabel("Temperature (°C)")
plt.ylabel("Relative humidity (%)")
plt.title("Temperature vs humidity (sample)")
plt.show()


Exploratory map


Exploratory map

May need to python3 -m pip install geopandas contextily to make these maps

import geopandas as gpd
import contextily as cx

# extract latitudes and longitudes
gdf = gpd.GeoDataFrame(
    met_daily.dropna(subset=["lat", "lon"]).copy(),
    geometry=gpd.points_from_xy(met_daily["lon"], met_daily["lat"]),
    crs="EPSG:4326"   
)

gdf_web = gdf.to_crs(epsg=3857)

ax = gdf_web.plot(markersize=10, alpha=0.6)
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
ax.set_axis_off()
plt.show()

Interactive map

  • Slightly more advanced to make an interactive map with folium.
  • You can use leaflet as another (my preferred) option, but it seems to have some problems in quarto.
Make this Notebook Trusted to load map: File -> Trust Notebook

Interactive map

This is how folium works:

  • select unique lat and lon
  • set up the map with folium.Map
  • add the location points on the map with folium.CircleMarker, add opacity and radius for point size here

Interactive map

import folium

pts = met_daily[["lat", "lon"]].dropna().drop_duplicates()

m = folium.Map(
    location=[pts["lat"].mean(), pts["lon"].mean()],
    zoom_start=4,
    tiles="CartoDB positron"
)

for _, r in pts.iterrows():
    folium.CircleMarker(
        location=[r["lat"], r["lon"]],
        radius=2,
        fill=True,
        fill_opacity=0.8,
        opacity=0.8,
    ).add_to(m)

m

Exploratory map

Let’s now look at where the hottest and coldest places were based on the daily temperatures. We add a layer for cold and a layer for hot.

cold = met_daily.loc[met_daily["temp"].idxmin()]
hot  = met_daily.loc[met_daily["temp"].idxmax()]
cold, hot

center = [met_daily["lat"].mean(), met_daily["lon"].mean()]
m = folium.Map(location=center, zoom_start=4, tiles="CartoDB positron")

# Coldest (blue)
folium.CircleMarker(
    location=[cold["lat"], cold["lon"]],
    radius=10,
    popup=f"Coldest daily mean: {cold['temp']:.2f}°C<br>USAFID={cold['USAFID']}<br>day={cold['day']}",
    color="blue",
    fill=True,
    fill_color="blue",
    fill_opacity=0.9,
).add_to(m)

# Hottest (red)
folium.CircleMarker(
    location=[hot["lat"], hot["lon"]],
    radius=10,
    popup=f"Hottest daily mean: {hot['temp']:.2f}°C<br>USAFID={hot['USAFID']}<br>day={hot['day']}",
    color="red",
    fill=True,
    fill_color="red",
    fill_opacity=0.9,
).add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Adding a bit more information to exploratory graphs

  • We have seen the concept of layering data in the maps.
  • We can add more information to a lot of different types of plots. For example, a scatterplot shows individual observations, but it can be hard to see relationships.
  • A common EDA move is to add a regression line to summarize how rh changes with temp

plotnine (ggplot-style plots in Python)

If you like ggplot2, plotnine uses the same “add layers with +” workflow: - map variables with aes() - add geoms (points, smoothers) - add labels/themes

import numpy as np
from sklearn.metrics import r2_score
from plotnine import (
    ggplot, aes, geom_point, geom_smooth, labs, theme_bw, annotate, coord_cartesian
)

met_s = met[["temp", "rh"]].dropna().sample(min(10_000, len(met)), random_state=1)

# Fit line: rh = m*temp + b
x = met_s["temp"].to_numpy()
y = met_s["rh"].to_numpy()
m, b = np.polyfit(x, y, 1)

y_hat = m * x + b
r2 = r2_score(y, y_hat)

label = f"rh = {m:.2f}·temp + {b:.2f}\nR² = {r2:.3f}"

# location for the label
x_pos = np.quantile(x, 0.9)
y_pos = np.quantile(y, 0.999)

(
  ggplot(met_s, aes(x="temp", y="rh"))
  + geom_point(alpha=0.2, size=1.0)
  + geom_smooth(method="lm", se=False)
  + annotate("text", x=x_pos, y=y_pos, label=label, ha="left")
  + coord_cartesian(ylim=(0, 100))
  + labs(
      title="Temperature vs humidity (sample) with linear fit",
      x="Temperature (°C)",
      y="Relative humidity (%)"
    )
  + theme_bw()
)


plotnine

(
  ggplot(met_s, aes(x="temp", y="rh"))
  + geom_point(alpha=0.2, size=1.0)
  + geom_smooth(method="loess", se=False)
  + coord_cartesian(ylim=(0, 100))
  + labs(
      title="Temperature vs humidity (sample) with linear fit",
      x="Temperature (°C)",
      y="Relative humidity (%)"
    )
  + theme_bw()
)